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Abstract. A two-parameter family of exchangeable partitions with a simple updating 
rule is introduced. The partition is identified with a randomized version of a standard 
symmetric Dirichlet species-sampling model with finitely many types. A power-like dis- 
tribution for the number of types is derived. 
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1. Introduction 

The Ewens-Pitman two-parameter family of exchangeable partitions U a,e of an infinite 
set has become a central model for species sampling (see [7J [18] for extensive background 
on exchangeability and properties of these partitions). One most attractive feature of 
this model is the following explicit rule of succession, which we formulate as sequential 
allocation of balls labelled 1,2,... in a series of boxes. Start with box Bi i with a single 
ball 1. At step n the allocation of n balls is a certain random partition IT"' 61 of the set of 
balls [n] := {1, . . . , n} into some number K n of nonempty boxes B n ^, . . . , B Ut x n , which we 
identify with their contents, and list the boxes by increase of the minimal labels of balldl 
Given at step n the number of occupied boxes is K n = k, and the occupancy counts are 
#B n j = rij for 1 < j < k (so n% + ■ ■ ■ + = n) , the partition 11"^ of [n + 1] at step 
n + 1 is obtained by randomly placing ball n + 1 according to the rules 

(O a ' e ) : in an old box B n+ ij := B n j U {n + 1} with probability 



ctfi ( i \ n j a 



(N a ' e ) : in a new box B n+1 ^+\ '■= { n + 1} with probability 

K {k;n u ...,n k ) := — — -. 

n + v 

For instance, if at step n = 6 the partition is {1, 3}, {2, 5, 6}, {4}, then ball 7 is added to 
one of the old boxes {1, 3} or {2, 5, 6} or {4} with probabilities specified by (O a,e ), and a 
new box {7} is created according to (N a,fl ). 
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1 Which means that min([n] \ (LljZiBn^)) S B n< j for 1 < j < K n . 
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Eventually, as all balls 1,2,... get allocated in boxes Bj := U n ^B n j, the collection of 
occupied boxes is almost surely infinite if the parameters (a, 9) are in the range {(a, 9) : 
< a < 1,9 > —a}. In contrast to that, the collection of occupied boxed has finite 
cardinality x if a < and —9 /a = x; this model, which is most relevant to the present 
study, has a long history going back to Fisher [3]. 

Exchangeable Gibbs partitions [HI [9] extend the Ewens-Pitman familjj^. The first rule 
is preserved in the sense that, given ball n + 1 is placed in one of the old boxes, it is 
placed in box j with probability u n j still proportional to nj — a, where a < 1 is a fixed 
genus of the partition. But the second rule allows more general functions v n of n and k, 
which agree with the first rule and the exchangeability of partition. Examples of Gibbs 
partitions of genus a G (0, 1) were studied in |HJ EH E]; from the results of these papers 
one can extract complicated formulas expressing v n in terms of special functions. See 
[15] for a survey of related topics and applications of random partitions to the Bayesian 
nonpar ametric inference. 

The practitioner willing to adopt a partition from the Ewens-Pitman family as a species- 
sampling model faces the dilemma: the total number of boxes is either a fixed finite 
number (Fisher's subfamily) or it is infinite. For applications it is desirable to also have 
tractable exchangeable partitions of N with finite but random number of boxes K. The 
present note suggests a two-parameter family of partitions of the latter kind, which are 
Gibbs partitions obtained by suitable mixing of Fisher's n _1 ' 6 '-partitions over 9, where the 
randomized 9 (which will be re-denoted x := 9) actually coincides with K. Equivalently, 
the partition can be generated by sampling from a mixture of symmetric Dirichlet random 
measures with unit weights on x points. 



2. Construction of the partition 

A new allocation rule is as follows. Start with box B\ x containing a single ball 1. At 
step n the allocation of n balls is a certain random partition IT n = (B Ut i, . . . , B n ^ n ) of 
the set of balls [n]. Given the number of boxes is K n = k, and the occupancy counts are 
i^B ni j = nj for 1 < j < k, the partition of [n + 1] at step n + 1 is obtained by randomly 
placing ball n + 1 

(O) : in an old box B n+ ij := B n j U {n + 1} with probability 

(nj + l)(n — k + 7) 
n 2 + + £ 

(N) : in a new box B n+1 ^+\ '■= { n + 1} with probability 

k 2 - 7& + C 



u nJ (k; ni, . . . , n h ) := 3 ^ — , j = l,...,k, 



v n (k;rii, ...,n k ) 



n 2 + 7^ + ( 



Here we are only interested in infinite exchangeable Gibbs partitions. Finite Gibbs partitions of [n] 
were discussed in [Hid], but these are typically not consistent as n varies. 
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To agree with the rules of probability the parameters 7 and ( must be chosen so that 
7 > and (i) either k 2 — 'jk + ( is (strictly) positive for all k G N, or (ii) the quadratic 
is positive for k G {1, . . . , k — 1} and has a root at k . In the case (ii) the number 
of occupied boxes never exceeds k . Part (O) is similar to the (O a,6, )-prescription with 
a = — 1: given ball n + 1 is placed in one of the old boxes, it is placed in box j with 
probability proportional to rij + 1. But part (N) is radically different from (N a ' e ) in that 
the probability of creating a new box is a ratio of quadratic polynomials in k and n. 

The probability of every particular partition B n i, . . . , B n x n with K n = k boxes con- 
taining ni, . . . , rik balls is easily calculated as 

p{ni,...,n k ) = -7^ll n r W 

lL=i( m 2 + l™ + jJi 

(with (a) m := a(a + 1) . . . (a + m — 1)), where (ni, . . . , n/J is an arbitrary composition 
of integer n, that is a vector of some length k G N whose components G N satisfy 
Ylj=i n j = n - The function p is sometimes called exchangeable partition probability func- 
tion (EPPF) [18]. For instance, the probability that the set of balls [6] is allocated after 
completing step 6 in three boxes {1, 3}, {2, 5, 6}, {4} is equal to p(2, 3, 1). Formula Q is 
a familiar Gibbs form of exchangeable partition of genus a = — 1 (see [HI [IE] and Section 
[8j). An exchangeable partition II of the infinite set of balls N is defined as the allocation 
of balls in boxes Bj := U n ^B n j, with the convention Bj = in the event K n < j for all 
n. For 7 = the partition II has only singleton boxes. 
The formula for 

( 7)n- fc niTl(» 2 -7» + 

rim=i<v +7^+o 

can be fully split in linear factors as 

_ (l)n-k(Sl + l)fc-l(s 2 + l)fc— 1 
(Zx + l) n -l(z2 + 

by factoring the quadratics as 

x 2 + 7X + C = + ^i)(x + z 2 ), x 2 - 7X + C = (x + si)(x + s 2 ), 

for some complex z\, z 2 , Si, s 2 . 

Using exchangeability and applying Equation (20) from [16], the total probability that 
the occupancy counts are (ni, . . . , rifc) equals 

P(X n = k, #B n>1 = n u . . . , #B niKn = n k ) = 
— r pin-i, • • • , n,k) = 



n,A 4- 

=1 



nj H h n fe 
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Let K nj7 . = #{1 < j < K n : #S n j = r} be the number of boxes occupied by exactly r 
out of n balls. By standard counting arguments, the last formula can be re-written as 



n 1 

F(K ntr = k r , r = 1, . . . , n) = f njfc ra! — 



for arbitrary integer vector of multiplicities (ki, . . . , k n ), with fc r > 0, Ylr=i K = k and 
E"=i rk r = n - 

3. Mixture representation and the number of occupied boxes 

Like for any Gibbs partition of genus —1, the number of occupied boxes K n is a suffi- 
cient statistic for the finite partition Il n , meaning that conditionally given K n = k the 
probability of each particular value of LT n with occupancy counts n 1; . . . , n k equals 

dn,k 

where the normalization constant is a Lah number [3] 

(n — l\n\ 

dn ' k= {k-i)w (3) 

The sequence (K n , n — 1, 2, . . . ) is a nondecreasing Markov chain with 0—1 increments and 
transition probabilities determined by the rule (N). The distribution of K n is calculated 

as 

F{K n = k) = d n , h v n , k . (4) 

By monotonicity, the limit K := lim^oo K n exists almost surely, and coincides with 
the number of nonempty boxes for the infinite partition LL Letting n — > oo in @ and 
using the standard asymptotics T(n + a)/Y{n + b) ~ n a ~ b we derive from ()2]), ()3]) 

The basic structural result about LT is the following: 

Theorem 1. Partition U is a mixture of partitions Tl~ 1,K over the parameter x, with a 
proper mixing distribution given by ([5]) . 

Proof. As every other Gibbs partition of genus —1, partition LT satisfies the conditioning 
relation 

u\{k = x} =rr 1 - 5< , (6) 

which says that given K the partition has the same distribution as some Fisher's partition 
of genus —1. We only need to verify that the weights in ([5]) add up to the unity. 

To avoid calculus, note that by the general theory [9j LT is a mixture of the LT -1 '^ 's 
with x G N, and the trivial singleton partition. Because every LT -1 '^ has x boxes, the 
probability ¥(K = oo) is equal to the weight of the singleton component in the mixture. 
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But for the singleton partition of [n] the number of boxes is equal to the number of balls, 
thus it remains to check that ¥(K n = n) — > as n — > oo, which is easily done by inspection 
of the transition rule (N). □ 

As x — > oo, the masses (J5j) exhibit a power-like decay, 

Y{z x + l)T{z 2 + 1) 



F(K = x) ~ — — with c 



r(7)r(s x + i)r(s 2 + 2) ' 

This explains, to an extent, the role of parameter 7. In particular, KK may be finite or 
infinite, depending on whether 7 > 1, or 7 < 1. 



4. Frequencies 

Recall some standard facts about the partition II -1 '^ (see |18|). This partition with x 
boxes Bi, . . . , B K can be generated by the following steps: 

(b) choose a value (yi, . . . , y H ) for the probability vector (P x ,i, ■ ■ ■ , P>t,x) uniformly 
distributed on the (x - l)-simplex {(y h . . . , y x ) : y { > 0, Ya=i Hi = 1 }, 

(c) allocate balls 1,2,... independently in x boxes with probabilities yi,...,y x of 
placing a ball in each of these boxes, 

(d) arrange the boxes by increase of the smallest labels of balls. 

The vector of frequencies (P x ,i, ■ ■ ■ , P*,k), defined through limit proportions 

n— >oo fl 

has the same distribution as the size-biased permutation of (P>e,h ■ ■ ■ 3 Px,h)- The frequen- 
cies have a convenient stick-breaking representation 

3-1 

= w i ~ ^3 with independent W t = beta(2, x-i), (8) 

i=l 

where z = 1, . . . , x and beta(2,0) is a Dirac mass at 1. See [7] for characterizations 
of this and other Ewens-Pitman partitions through independence of factors in such a 
stick-breaking representation. 

Now let us apply the above to the partition II. The mixture representation in Theorem 
[1] implies that II can be constructed by first 

(a) choosing a value x for K from distribution (J5J), 

then following the above steps (b), (c) and (d). The frequencies (Pi, . . . , Pk) of nonempty 
boxes Bi, ... , Bk are obtainable from (jSJ) by mixing with weights (|5j). 
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5. Exchangeable sequences 

Let (S, B, /z) be a Polish space with a nonatomic probability measure fi. Let II be the 
partition of N constructed above and T 1; T 2 , ... be an i.i.d. sample from (S, £>, /i), also 
independent of II. With these random objects one naturally associates an infinite ex- 
changeable S- valued sequence X 1 ,X 2 ,... with marginal distributions fi, as follows (see 
[H [12]). Attach to every ball in box Bj the same tag Tj, for j = 1, . . . , K. Then define 
Xi, X 2 , . . . to be the sequence of tags of balls 1,2,.... 

Obviously, K,Ti, . . . , T K and IT can be recovered from Xi, X 2 , Indeed, Tj is the jth 

distinct value in the sequence X%, X 2 , ■ ■ ■ and Bj = {n : X n = Tj} for j = 1, . . . , K. The 
same applies to finite partitions Ii n with B n j = Bj R [n]. 

The prediction rule [12] associated with Xi,X 2 , ... is the formula for conditional dis- 
tribution 

P(X n+ i e ds I Xi, . . . , X n ) = uJ n j5 Tj (ds) + v n fjL(ds), 

3=1 

where Ti, . . . , Tk„ are the distinct values in X±, . . . , X n , and co n ,j, v n are the functions of 
the partition LI n , as specified by the rules (O) and (N). 

The random measure F in de Finetti's representation of X 1; X 2 , ... is a mixture 

oo 

F = ^P(K = x)i^, 

K=l 

where 

F H {ds) = P*j8f {ds) , x e N 
i=i 

are Dirichlet( l, . . . , 1 ) random measures on (S 1 , i3, //), that is the vector (P X) i, . . . , -P^x) is 

K 

uniformly distributed on the (x — l)-simplex and is independent of (Ti, T 2 , . . . ), and the 
random variables T/s are i.i.d. (^i). 

6. The case ( = 

We focus now on the case £ = 0. Then 7 £ [0, 1] is the admissible range, but we shall 
exclude the trivial edge cases 7 = 0, respectively, 7 = 1 of the singleton and single-box 
partitions. 

Formula (T5]) simplifies as 

(fc-l)!(l- 7 )fc-l(7)n-fc 

(n-l)!(l+ T )„-i ' 

and there is a further obvious cancellation of some factors. Furthermore, (j5]) specializes 

as 

P(/^ = x ) = 2ii^kzl ; x = 1, 2, . . . (9) 
x! 
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which is a distribution familiar from the discrete renewal theory (a summary is found in 
[T7] , p. 85). The distribution has also appeared in connection with Ii a ' e (0 < a < 1) 
partitions and other occupancy problems [H El [101 [T7] . 

Thinking of (Q as a prior distribution for K, the posterior distribution is found from 
PJ, © and the distribution of the number of occupied boxes for the II _1 ' fc partition 
(instance of Equation (3.11) in [18]): 

P(K = x | K n = k) = (k _ly' 1 / n _ 1)] U^-QUb + o- *) IK' - TO' ( 10 ) 

v y ' v y ' i=i j=i z=fc 

for 1 < k < n, K>k. Note that the conditioning here can be replaced by conditioning 
on an arbitrary value of the partition Il n with k boxes. 
The frequency P\ of box B\ has distribution 

P(A edy) = f2 7(1 "J^ 1 P(P*,i G dy) = 7 ^(^) + (1 - ihy^'dy, y E (0, 1], 

K=l 

which is a mixture of Dirac mass at 1 and beta(7, 1) density. Interestingly, distributions of 
this kind have appeared in connection with other partition- valued processes [HI [11] . The 
distribution is useful to compute expected values of symmetric statistics of the frequencies 
of the kind Y^j=i f(Pj) P2], for example 



K 



E p j J = E ( p l 
which agrees with the F(K n = 1) instance of (jlj. 



n + 7 



7. Restricted exchangeability 

It is of interest to explore a more general situation when the process starts with some 
initial allocation of a few balls in boxes. This can be thought of as prior information of 
the observer about the existing species. For simplicity we shall only consider the case 

C = o. 

Fix m > 1 and a partition b = (pi, . . . ,bk) of [m] with k positive box-sizes j^bj = 
rrij, j = 1, . . . , k. Let Pb be the law of the infinite partition II constructed by the rules 
(O) and (N) starting with the initial allocation of balls II m = b. In particular, P = P{i}. 
Note that Pb is well defined for any value of the parameter in the range 

— (m — k) < 7 < k, 

and for 7 G (0,1) the measure ^-{1}, conditioned on {IT m = b}, coincides with P b . 
Explicitly, under P b every value of n n = (B Ut i, . . . , B n Kn ) with 

K n = x>k, #Bj =rij, j = 1, . . . , x; ni>mi, i = l,...,k 
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has probability 

p(nx,...,n x ) 

p h {ni, ...,n K ) 



p(mi, . . .,m k ) 

where p is given by ([I]). Formula (|9]) for the terminal distribution of the number of boxes 
is still valid for the extended range of 7. 

Observing that £>b is symmetric in the arguments rij for k < j < x, it follows that Pb is 
invariant under permutations of the set N\ [m]. On the other hand, for every permutation 
a : [m] — > [m] we have F a b = aP^. Moreover, the restriction of IT on N \ [m] under Pb has 
the same law as under P CT b, that is the restriction depends on b only through (mi, . . . , m k ). 

Examples Suppose 7 = 1. Then Fin(K = 1) = 1 which corresponds to the trivial 
one-block partition, but F^^ 2 }(K = x) = for x > 2. 

Suppose 7 = 0. Then II under P{i} is the trivial singleton partition, but under P{i,2} 
we have F{i j2 }(K = x) = for x > 1. 

8. General Gibbs partitions and the new family 

Both the Ewens-Pitman family and the partitions introduced in this note can be con- 
structed in a unified way, using simple algebraic identities. Recall from [9j [18] that the 
Gibbs form for EPPF p of genus a G (—00, 1) is H 



p{n x , ...,n k ) = v Hjk Y[( 



a)n,--i, 



where the triangular array (v ntk ) is nonnegative and satisfies the recursion 

v n ,k = {n - ka)v n+ i yk + v n+1<k+ i, 1 < k < n (11) 

with normalization = 1. The recursion goes backwards, from n+ 1 to n, thus it cannot 
be 'solved' in a unique way rather has a convex set of solutions, each corresponding to 
distribution of some exchangeable partition. 

For Gibbs partition the number of occupied boxes (K n , n — 1, 2, . . . ) is a nondecreas- 
ing Markov chain, viewed conveniently as a bivariate space-time walk (n, K n ), which has 
backward transition probabilities depending on a but not on (f n ,fc). The backward tran- 
sition probabilities are determined from the conditioning relation: given (n, K n ) = (n, k), 
the probability of each admissible path from (1,1) to (n,k) is proportional to the product 
of weights along the path, where the weight of transition (n, k) — > (n + 1, k) is n — ka, 
and that of (n, k) — > (n + 1, k + 1) is 1. The normalizing total sum d n ^ k (a) of such prod- 
ucts over the paths from (1,1) to (n,k) is known as a generalized Stirling number [3]. 
Each particular solution to ( TTTj) determines the law of {K n ) via the marginal distributions 
F(K n = k) = v n>k d n>k (a). 

Large-n properties of Gibbs partitions depend on a. In particular, there exists an 
almost-sure limit K = lim^oo K n /c n (a), where c n (a) = n a for a G (0, 1), c n (0) = logn 



3 We omit here the case a = — 00. 
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u n (k; rii,..., n k ) = f(K n+1 = k + 1 1 K n = k) 



and c n (a) = 1 for a < 0. The law of K is characteristic for partition of given genus. 
That is to say, a generic Gibbs partition is a unique mixture over x of extreme partitions 
for which K = x a.s. Note that K has continuous range for a G [0, 1), and discrete for 
a < 0. For a < the extremes are Fisher's partitions n Q ' _Qx . For a = the extremes 
are Ewens' partitions n 0,5< (with x G [0, oo]). For a G (0,1) Ewens-Pitman partitions 
are not extreme, rather the extremes are obtainable by conditioning any U a ' e on K = x; 
the tv^fc's for these extreme partitions were identified in [13J in terms of the generalized 
hypergeometric functions. 

Following [5], where recursions akin to ( II ip were treated, one can seek for special 
solutions of the form 

ll m =i Km) 

where /, g, h : N — > R satisfy the identity 

(n - ak)f(n - k) + g(k) = h(n), 1 < k < n, n G N. (13) 

Moreover, /, h must be (strictly) positive on N, while g may be either positive on N or 
positive on some integer interval {1, . . . , ko — 1} with g(ko) = 0. Each such triple defines 
a Gibbs partition with the 'new boxes' updating rule of the form 

h(n) 

(where n = r\\ + • — h Uk), complemented by the associated version of the (O)-rule (as in 
Sections [1] and |2J) • 

Now we can review two instances of (fl2l) : 

• Exploiting the identity n— ak+ak+6 = n+8 we may choose f(n) = 1, g(n) = an+ 
9 and h(n) = n + 9. This yields the Ewens-Pitman partitions with the succession 
rule (N ' 61 ). Note that the admissible range for a, 9 is determined straightforwardly 
from the positivity. 

• The identity 

(n — k + 7) (n + k) + k 2 — jk + C = n 2 + l n + C 

is of the kind fTTBl with a = —1. We choose f(n) = n + 7, g(n) = n 2 — / yn + 9 and 
h(n) = n 2 + '-yn + 9 to arrive at the partitions itroduced in this paper. 

It is natural to wonder if there are any other Gibbs partitions of the form f[T21) . 

The ansatz (I12p is sometimes useful to deal with recursions like ffTTl) with other weights 
depending in a simple way on n and k [5]. For instance, if both weights equal 1, then 
each solution defines distribution of an exchangeable — 1 sequence (see PQ), for which 
the Markov chain (K n ) counts the number of l's among the first n bits. An instructive 
exercise is to construct by this method of specifying the triple /, g, h two distinguished 
families of the exchangeable processes - the homogeneous Bernoulli processes and Polya's 
urns with two colors. 
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